subspace distance
Collaborative and Efficient Fine-tuning: Leveraging Task Similarity
Magakyan, Gagik, Reisizadeh, Amirhossein, Park, Chanwoo, Parrilo, Pablo A., Ozdaglar, Asuman
Adaptability has been regarded as a central feature in the foundation models, enabling them to effectively acclimate to unseen downstream tasks. Parameter-efficient fine-tuning methods such as celebrated LoRA facilitate efficient adaptation of large foundation models using labeled, high-quality and generally scarce task data. To mitigate data scarcity in fine-tuning of foundation models, we propose to leverage task similarity across multiple downstream users. Intuitively, users with similar tasks must be able to assist each other in boosting the effective fine-tuning data size. We propose Collaborative Low-Rank Adaptation, or CoLoRA, which exploits task similarity to collaboratively and efficiently fine-tune personalized foundation models. The main idea in CoLoRA is to train one shared adapter capturing underlying task similarities across all tasks, and personalized adapters tailored to user-specific tasks. We theoretically study CoLoRA on heterogeneous linear regression and provide provable guarantees for ground truth recovery. We also conduct several natural language experiments with varying task similarity, which further demonstrate that when trained together with similar tasks, individual performances are significantly boosted.
On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning
Zhang, Thomas T., Moniri, Behrad, Nagwekar, Ansh, Rahman, Faraz, Xue, Anton, Hassani, Hamed, Matni, Nikolai
Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, linear representation learning and single-index learning, which are widely used to study how typical algorithms efficiently learn useful features to enable generalization. In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work. We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.
ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Contrastive Framework
Zhang, Hengyuan, Shang, Chenming, Wang, Sizhe, Zhang, Dongdong, Yao, Feng, Sun, Renliang, Yu, Yiyao, Yang, Yujiu, Wei, Furu
Although fine-tuning Large Language Models (LLMs) with multilingual data can rapidly enhance the multilingual capabilities of LLMs, they still exhibit a performance gap between the dominant language (e.g., English) and non-dominant ones due to the imbalance of training data across languages. To further enhance the performance of non-dominant languages, we propose ShifCon, a Shift-based Contrastive framework that aligns the internal forward process of other languages toward that of the dominant one. Specifically, it shifts the representations of non-dominant languages into the dominant language subspace, allowing them to access relatively rich information encoded in the model parameters. The enriched representations are then shifted back into their original language subspace before generation. Moreover, we introduce a subspace distance metric to pinpoint the optimal layer area for shifting representations and employ multilingual contrastive learning to further enhance the alignment of representations within this area. Experiments demonstrate that our ShifCon framework significantly enhances the performance of non-dominant languages, particularly for low-resource ones. Further analysis offers extra insights to verify the effectiveness of ShifCon and propel future research
On the Adversarial Robustness of Subspace Learning
Li, Fuwei, Lai, Lifeng, Cui, Shuguang
In this paper, we study the adversarial robustness of subspace learning problems. Different from the assumptions made in existing work on robust subspace learning where data samples are contaminated by gross sparse outliers or small dense noises, we consider a more powerful adversary who can first observe the data matrix and then intentionally modify the whole data matrix. We first characterize the optimal rank-one attack strategy that maximizes the subspace distance between the subspace learned from the original data matrix and that learned from the modified data matrix. We then generalize the study to the scenario without the rank constraint and characterize the corresponding optimal attack strategy. Our analysis shows that the optimal strategies depend on the singular values of the original data matrix and the adversary's energy budget. Finally, we provide numerical experiments and practical applications to demonstrate the efficiency of the attack strategies.
Compressed Subspace Learning Based on Canonical Angle Preserving Property
Jiao, Yuchen, Li, Gen, Gu, Yuantao
A standard way to tackle the challenging task of learning from high-dimensional data is to exploit its underlying low-dimensional structure. Union of Subspaces (UoS) is a popular and powerful model to describe such structure which assumes that the data lies in the union of a collection of low-dimensional subspaces. Extracting useful information from UoS structure of data has become the task of the newly-emerged field of subspace learning. In this paper, we investigate how random projection, an efficient and commonly-used method for dimensionality reduction, distorts the UoS structure of data. Here the fine details of UoS structure are described in terms of canonical angles (also known as principal angles) between subspaces, which is a well-known characterization for relative subspace positions by a sequence of angles. It is proved that random projection with the so-called Johnson-Lindenstrauss (JL) property approximately preserves canonical angles between subspaces. As canonical angles completely determine the relative position of subspaces, our result indicates that random projection approximately preserves structure of a union of subspaces. Inspired by this result, we propose in this paper the framework of Compressed Subspace Learning (CSL), which enables to extract useful information from the UoS structure of data in a greatly reduced dimension and has the advantage of lower computational cost and memory requirements. We demonstrate the effectiveness of CSL in various subspace-related tasks such as subspace visualization, active subspace detection, and subspace clustering.